Guidelines for normalising Early Modern English corpora: Decisions and justifications
نویسندگان
چکیده
Corpora of Early Modern English have been collected and released for research for a number of years. With large scale digitisation activities gathering pace in the last decade, much more historical textual data is now available for research on numerous topics including historical linguistics and conceptual history. We summarise previous research which has shown that it is necessary to map historical spelling variants to modern equivalents in order to successfully apply natural language processing and corpus linguistics methods. Manual and semiautomatic methods have been devised to support this normalisation and standardisation process. We argue that it is important to develop a linguistically meaningful rationale to achieve good results from this process. In order to do so, we propose a number of guidelines for normalising corpora and show how these guidelines have been applied in the Corpus of English Dialogues.
منابع مشابه
VARD 2: A tool for dealing with spelling variation in historical corpora
Spelling variation causes considerable problems for corpus linguistic techniques such as frequency analysis, concordancing and automatic tagging, with a significant impact being made on recall and the accuracy of results [1]. This paper will focus on Early Modern English, the most recent period of the English language to include a large amount of inconsistent spelling. Although many corpora of ...
متن کاملNormalising the IJS-ELAN Slovene-English Parallel Corpus for the Extraction of Multilingual Terminology
Various efforts have been made for the development of tools and methods dedicated to the automatic processing of multilingual terminology databases. For that purpose, multilingual parallel corpora have been used as a basis resource. However, most of the neologisms in technical and scientific domains are realised by multiword terms that are rarely identified in parallel corpora. In this paper, w...
متن کاملComparative Study of the Academic Vocabulary Content of Electronic Engi-neering Corpora, GE Materials and M.S. Entrance Examinations
The importance of vocabulary learning has been underlined in the field of English for Academic Purposes (EAP) because non-English majors who require reading English texts in their fields of study have to expand their English vocabulary knowledge much more efficiently than ordinary ESL/EFL learners. Since academic vocabulary instruction in Iranian universities is realized through the use of Gene...
متن کاملFrom semi-automatic to automatic affix extraction in Middle English corpora: Building a sustainable database for analyzing derivational morphology over time
The annotation of large corpora is usually restricted to syntactic structure and word class. Pure lexical information and information on the structure of words are stored in specialized dictionaries (Baayen et al., 1995). Both data structures – dictionary and text corpus – can be matched to get e.g. a distribution of certain (restricted) lexical information from a text. This procedure works fin...
متن کاملMove Structures in “Statement-of-the-Problem” Sections of M.A. Theses: The Case of Native and Nonnative Speakers of English
Understanding how to structure the “Statement-of-the-Problem” (SP) section of a thesis is necessary for EFL students to develop a logical argumentation for a problem statement. This study intended to compare Move structures of SP sections of theses written by native speakers of Persian (NSPs) and English (NSEs). To this end, 100 SP sections (50 SP sections written by NSE...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015